Skip to content

Ship 8 Tranche 5a: high-bit 4:2:0 RGBA u8 SIMD#25

Merged
uqio merged 5 commits intomainfrom
feat/ship8-rgba-high-bit-420-u8-simd
Apr 26, 2026
Merged

Ship 8 Tranche 5a: high-bit 4:2:0 RGBA u8 SIMD#25
uqio merged 5 commits intomainfrom
feat/ship8-rgba-high-bit-420-u8-simd

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented Apr 26, 2026

Summary

Adds u8 RGBA SIMD across all 5 backends for high-bit 4:2:0 YUV (yuv420p9/10/12/14/16, p010/p012/p016) and wires them into the 8 high-bit u8 RGBA dispatchers in `src/row/mod.rs`. Builds on the scalar prep + dispatcher signatures landed in PR #24. The companion u16 RGBA SIMD work is deferred to Tranche 5b.

Changes

  • 5 SIMD backends — NEON / SSE4.1 / AVX2 / AVX-512 / wasm simd128 — each gain a const-generic `*_to_rgb_or_rgba_row<BITS, ALPHA>` template across 4 kernel families:

    • planar BITS-generic: `yuv_420p_n_to_rgb_or_rgba_row<BITS={9,10,12,14}, ALPHA>`
    • semi-planar BITS-generic: `p_n_to_rgb_or_rgba_row<BITS={10,12}, ALPHA>` (P016 has its own family)
    • 16-bit planar: `yuv_420p16_to_rgb_or_rgba_row`
    • 16-bit semi-planar: `p16_to_rgb_or_rgba_row`

    Existing RGB and new RGBA wrappers are thin shims over the shared template. Only the store (`vst3q_u8` vs `vst4q_u8`, `write_rgb_` vs `write_rgba_`) and the scalar tail dispatch branch on `ALPHA`; per-pixel math is unchanged.

  • 8 high-bit u8 RGBA dispatchers wired in `src/row/mod.rs` (`yuv420p9/10/12/14/16_to_rgba_row`, `p010/p012/p016_to_rgba_row`) — replace the prior `let _ = use_simd` stubs with the standard `cfg_select!` per-arch route block, mirroring the existing RGB dispatchers. `use_simd = false` still forces scalar.

  • Per-backend RGBA equivalence tests — ~30 new `#[test]` functions across the 5 backend test modules. Each new x86 test gates on `is_x86_feature_detected!` so the suite stays clean under sanitizer/Miri/non-feature-flagged CI runners.

  • Compile-time `const { assert!(BITS == ...) }` retained on every shared template (was already a Codex-flagged hardening from prior tranches).

Test plan

  • `cargo test --lib` on host (aarch64-darwin / NEON path): 485 pass, 0 fail
  • `cargo check --lib --target wasm32-unknown-unknown` clean
  • `cargo check --lib --target x86_64-unknown-freebsd` clean (incl. `--tests`)
  • `RUSTFLAGS="-Dwarnings" cargo clippy --lib --tests` clean
  • CI: ASAN sanitizer run on x86_64-linux (was failing before `is_x86_feature_detected!` guards were added; should now pass)
  • CI: Miri on x86_64-linux (was failing before guards; should now pass)
  • On-device equivalence run for AVX2 / AVX-512 / SSE4.1 hardware (deferred to CI)

🤖 Generated with Claude Code

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds SIMD-backed support for high-bit-depth 4:2:0 RGBA (u8 output) row conversion paths across multiple architectures and wires them into the public row dispatch layer.

Changes:

  • Add use_simd-controlled dispatcher routing for high-bit 4:2:0 RGBA u8 conversions (YUV420p 9/10/12/14, P010/P012, YUV420p16, P016).
  • Implement RGBA SIMD entrypoints by reusing existing RGB kernels via shared *_to_rgb_or_rgba_row implementations with an ALPHA const parameter.
  • Add scalar↔SIMD byte-equivalence tests for the new RGBA SIMD paths across backends.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/row/mod.rs Wires high-bit 4:2:0 RGBA u8 dispatchers to per-arch SIMD implementations with scalar fallback.
src/row/arch/x86_sse41.rs Adds SSE4.1 RGBA wrappers/shared kernels (ALPHA=true) for high-bit 4:2:0 and P010/P012/P016 families.
src/row/arch/x86_sse41/tests.rs Adds SSE4.1 scalar equivalence tests for the new RGBA high-bit 4:2:0 paths.
src/row/arch/x86_avx2.rs Adds AVX2 RGBA wrappers/shared kernels (ALPHA=true) for high-bit 4:2:0 and P010/P012/P016 families.
src/row/arch/x86_avx2/tests.rs Adds AVX2 scalar equivalence tests for the new RGBA high-bit 4:2:0 paths.
src/row/arch/x86_avx512.rs Adds AVX-512 RGBA wrappers/shared kernels (ALPHA=true) for high-bit 4:2:0 and P010/P012/P016 families.
src/row/arch/x86_avx512/tests.rs Adds AVX-512 scalar equivalence tests for the new RGBA high-bit 4:2:0 paths.
src/row/arch/wasm_simd128.rs Adds wasm simd128 RGBA wrappers/shared kernels (ALPHA=true) for high-bit 4:2:0 and P010/P012/P016 families.
src/row/arch/wasm_simd128/tests.rs Adds wasm simd128 scalar equivalence tests for the new RGBA high-bit 4:2:0 paths.
src/row/arch/neon.rs Adds NEON RGBA wrappers/shared kernels (ALPHA=true) for high-bit 4:2:0 and P010/P012/P016 families.
src/row/arch/neon/tests.rs Adds NEON scalar equivalence tests for the new RGBA high-bit 4:2:0 paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1634 to +1638
fn check_planar_u8_sse41_rgba_equivalence_n<const BITS: u32>(
width: usize,
matrix: ColorMatrix,
full_range: bool,
) {
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly added RGBA equivalence helpers/tests call SSE4.1 intrinsics unconditionally (via unsafe { yuv_420p_n_to_rgba_row... }, etc.). Unlike the existing SSE4.1 tests earlier in this file, these helpers don’t gate execution with std::arch::is_x86_feature_detected!("sse4.1"), which can cause SIGILL on CPUs without SSE4.1 (and may also break under Miri if the detection would otherwise early-return). Add the same feature-detection guard (either in each helper or at the start of each #[test]).

Copilot uses AI. Check for mistakes.
width: usize,
matrix: ColorMatrix,
full_range: bool,
) {
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new AVX2 RGBA equivalence helpers/tests invoke AVX2 intrinsics unconditionally. Existing AVX2 tests in this file early-return when !std::arch::is_x86_feature_detected!("avx2"); these new ones should do the same to avoid illegal-instruction crashes on non-AVX2 hosts (and to match the established test pattern in this module).

Suggested change
) {
) {
if !std::arch::is_x86_feature_detected!("avx2") {
return;
}

Copilot uses AI. Check for mistakes.
width: usize,
matrix: ColorMatrix,
full_range: bool,
) {
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new AVX-512 RGBA equivalence helpers/tests call AVX-512BW intrinsics unconditionally. Other AVX-512 tests in this file are guarded by std::arch::is_x86_feature_detected!("avx512bw"); without the same guard these tests can SIGILL on CPUs lacking AVX-512BW. Add the feature-detection early-return (in helpers or per-test) consistent with the rest of the file.

Suggested change
) {
) {
if !std::arch::is_x86_feature_detected!("avx512bw") {
return;
}

Copilot uses AI. Check for mistakes.
@al8n al8n changed the title Feat/ship8 rgba high bit 420 u8 simd Ship 8 Tranche 5a: high-bit 4:2:0 RGBA u8 SIMD Apr 26, 2026
@uqio uqio merged commit 10d3e17 into main Apr 26, 2026
43 checks passed
@uqio uqio deleted the feat/ship8-rgba-high-bit-420-u8-simd branch April 26, 2026 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants